Reconfigurable Graphics Processor for Performance Improvement

ABSTRACT

Power gating a portion of a graphics processor may be used to improve performance or to achieve a power budget. A processor granularity, such as a slice or subslice, may be gated.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation application based on non-provisional applicationSer. No. 13/993,696 filed on Jun. 13, 2013, hereby expresslyincorporated by reference herein.

BACKGROUND

This relates generally to graphics processing in computer systems.

Graphics processors run under different processing conditions. In somecases, they can run in higher power consumption modes and in lower powerconsumption modes. It would be desirable to obtain the maximumperformance possible, given the power consumption mode that the graphicsprocessor operates within.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block depiction of one embodiment of the present invention;

FIG. 2 is a flow chart for another embodiment of the present invention;

FIG. 3 is a schematic depiction of one embodiment of the presentinvention;

FIG. 4 is a hypothetical graph of performance versus power budget forone embodiment of the present invention; and

FIG. 5 is a hypothetical graph of power budget versus time for oneembodiment.

DETAILED DESCRIPTION

In some embodiments, graphics processing cores automatically reconfigurethemselves to increase or maximize performance in both higher and lowerpower envelopes by dynamically power gating portions of the graphicsprocessing engine. As used herein, power gating includes activating ordeactivating a core portion.

While an example will be provided using a tablet computer graphicsprocessor, the same concepts apply to any graphics processor.

A graphics processing core typically includes a number of executionunits that perform arithmetic, logic, and other operations. A number ofsamplers may be used for texture processing. A sampler and a number ofexecution units are a subslice. A number of subslices may be included ina particular graphics processing core, based on target performance andpower budget. Subslices are combined to form a graphics processingslice. A graphics processing core may contain one or more slices. In atablet graphics processing core, single slice and one, two, or threesubslice designs are commonly used. Multiple slices are common in clientgraphics processors.

Thus, referring to FIG. 1, showing a typical graphics processor core,the core 10 includes a slice number 1, labeled 14, that may include afixed function pipeline logic 16 and a number of subslices 18 a and 18b. More slices and more or less subslices may be included in someembodiments. Also included in the graphics processor core is a fixedfunction logic 12.

The power and performance characteristics of one, two, and threesubslice designs are different, as indicated in FIG. 4. Performanceincreases linearly up to a knee A (for example, around 2.5 Watts) ofperformance versus power dissipation, as one example. Below this knee,the graphics processor is operating in a frequency scaled region wheregraphics processor frequency can be raised without raising the operatingvoltage. Above the knee, graphics processor frequency is only raised ifvoltage is increased as well, which generally has a negative impact onpower dissipation and results in a flatter plot of performance versuspower dissipation than is experienced in the frequency scaled region.

One or more of the subslices of a graphics processor core may be powergated. Generally, the more subslices, the higher performance, but theperformance gap reduces as the available power budget reduces and theremay be a point B in FIG. 4 (for example, at around 1.5 Watts) whereinthe single subslice configuration performs better than a two subsliceconfiguration. This better performance is the result of the largerconfiguration having significantly more leakage power and, therefore,less room for dynamic power. In a lower power budget, less room fordynamic power can significantly limit the frequency and performance ofthe larger configuration, making it look less attractive than thesmaller configuration.

In some embodiments, a power sharing mechanism may be used to achieveefficient dynamic power gating of graphics processor subslices. Ofcourse, instead of gating subslice power consumption, the same conceptsapply to dynamic power gating of any number of graphics processor slicesin embodiments with more than one slice.

Graphics processors may have a power sharing function that basicallyincreases (or decreases) power over time, as shown in FIG. 5. At aparticular point in time t₁, a graphics processor core may be assigned,by a power control unit, a power budget TDP1 at a particular lower levelthat forces the graphics processor to operate at a particular frequencyf1 that is the maximum frequency that allows the graphics processor notto exceed its allocated power budget. As the power budget is increasedover time, the graphics core may operate at progressively higherfrequencies.

With subslice power gating, the power control unit knows ahead of timethat the graphics processor core can be configured with a fullcomplement of execution units and subslices or with less execution unitsand subslices. For example, one embodiment may include sixteen executionunits and two subslices and another mode of operation may include eightexecution units and one subslice. When the power budget available to thegraphics processor is small, the graphics processor may be configured inthe smaller core configuration with one of the two available subslicesbeing power gated.

Generally, a subslice is not simply turned off at any particular pointin time, as it may be executing active threads. When the power controlunit makes the determination that a subslice should be power gated, theimmediate action is to block new graphics processing threads from beingscheduled on that subslice. Thus, it may take some time before thethreads already executing on the subslice complete and the subslicebecomes idle. Only then is the subslice actually power gated in oneembodiment.

With power gating, as power budget is progressively increased, at somepoint a subslice that was initially turned off gets turned on. Or,reversely, as the processor progresses from higher to lower powerbudgets, a subslice may be turned off (as indicated in FIG. 5).

When a subslice is turned off, the frequency may increase or double (ifone of two subslices is turned off). As a result, the performance canremain relatively stable, since the remaining subslice operates twice asfast as the two subslices. This frequency increase ensures a smooth(from a performance perspective) transition from the larger ungatedgraphics core to the smaller gated graphics core. Reversely, when asubslice is ungated and we transition to a two-subslice graphics core,the clock frequency reduces by half, to maintain overall performance atabout the same level.

The clock frequency changes described above are designed to notsignificantly disrupt (e.g. double or half) the overall performance ofthe scalable portion of the graphics core, (subslice logic shown inFIG. 1) at the point in time when power gating occurs. However, if theact of power gating has produced a more power-efficient graphics coredue to its lower leakage dissipation, this would subsequently allow thegraphics core to raise its clock frequency and power dissipation to fillits allocated power budget. This would lead to increased performance,which was the ultimate goal of power gating.

On the other hand, when the power budget allocated to the graphics coreincreases and allows for adding a subslice, reducing the clock frequencyby half will initially preserve the same performance. However, given theincreased graphics power budget, the graphics core will be then allowedto also raise its frequency, which will bring the desired result ofraising performance.

Raising or reducing clock frequency in the process of dynamic powergating as described above works well for the scalable portion of thegraphics core, i.e. the subslice shown in FIG. 1. If, however, the sameclock is used by the non-scalable portion of the graphics core (e.g. thefixed function logic 12, shown in FIG. 1) then changing the clockfrequency may affect, and potentially limit, the performance of thatlogic. This would not be desirable. To avoid that, the non-scalablelogic may use its own independent clock which is not affected by clockfrequency changes in the scalable graphics logic.

Switching from a larger configuration to a smaller configuration canimprove performance because it provides leakage savings and makes roomfor more dynamic power. At the same time, switching from the larger tothe smaller configuration may potentially lead to increased dynamicpower since the frequency increases correspondingly. Therefore, thetransition from larger to smaller configuration may happen when theleakage savings achieved exceeds the dynamic power cost due to thecorresponding frequency increase. When that condition holds, there willbe a net power savings by the transition and there is room to increasethe frequency even further and achieve a net performance gain.

Thus, to give an example, with a sixteen execution unit, two subsliceunit transitioning to an eight execution unit, one subslice unit as aresult of power gating, the following Leakage Delta (LD) equationsapply:

LD>f ₈ *C ₈ V ₈ ² −f ₁₆ *C ₁₆ V ₁₆ ²   (1)

LD>f ₈ *AR ₈*Cmax₈ *V ₈ ² −f ₁₆ *V ₁₆ ²   (2)

where f8 and f16 are the frequencies of the eight and sixteen executionunit configurations at the point in time when the power gating orungating event occurs; V8 and V16 are the operating voltages of the twographics processing cores when the power gating event occurs; C8 and C16are the switching capacitance of the two graphics processing cores whenthe power gating event occurs; Cmax8 and Cmax16 and the maximumswitching capacitance of the two graphics cores for a power virusworkload; and AR16 and AR8 are the application ratios of the two coresright before and after the power gating or ungating event. The‘Application Ratio’ of an application is defined as the ratio of thegraphics core switching capacitance when that application executes onthe core over the switching capacitance of the graphics core powervirus.

These equations may be used to make the decision to initiate subslicepower gating or not. The package power-sharing mechanism, which mayalready be supported by the graphics processor, involves knowledge ofthe leakage power as a function of operating conditions, including die,voltage, and temperature and that is usually fused into the part, sothat this information is already available. From that information, theleakage delta of the power-gated graphics core can be dynamicallycalculated as well, by simply scaling total leakage by the numberappropriate when a subslice is power gated.

If the graphics processing core is currently configured as a sixteenexecution unit, two subslice core, f16 and V16 are its current frequencyand voltage and the target frequency f8 to switch to, after power gatingthe subslice, is then two times f16. The matching voltage V8 is alsoknown ahead of time. The current switching capacitance, C16, can beestimated using turbo energy counters already available in some graphicsprocessing engines. The maximum capacitance Cmax8 is a static quantitythat is also known ahead of time and fused into the part.

Thus, the only quantities in the above two equations that are not knownand cannot be directly calculated using the existing power-sharinginfrastructure is the target switching capacitance C8 and the targetapplication ratio AR8 of the smaller configuration that we want toswitch to. These two quantities are essentially equivalent since one canbe calculated from the other (C8=AR8*Cmax8).

One way to estimate C8 or AR8 is a follows. Silicon measurements takenwith different graphics workloads may show that the application ratio ofa workload running on a larger graphics core is lower than theapplication ratio of the same workload running on a smaller graphicscore by a relatively predictable scale factor, such as 0.8× or 0.7×, fora wide range of workloads. So, one approach is to do a post-siliconcharacterization of a range of applications running on the power gatedor un-gated graphics cores. The average sixteen execution unit versuseight execution unit application ratio scale factor can then becalculated and may be programmed as a static application ratio scalefactor. While active in sixteen execution units mode, the graphics corecan dynamically estimate its current application ratio using theavailable turbo energy counters and then project its application ratioAR8 that it would have if it operated in eight execution units mode byusing the scale factor described above.

Alternatively, energy monitor counters can be used to correlate (via acurve-fitting method) the values of the energy counters to not only thecapacitance of the current sixteen execution unit graphics core (C₆) butalso for the target sixteen execution unit graphics core that we willswitch to after power gating occurs. Once that capacitance is estimated,equation (2) can be used to make the power gating decision. This methodmay be more accurate than the previous one, but may involve moredetailed and time consuming post-silicon characterization of the energymonitor counters for both the 16 and 8 execution unit configurations.

Once the decision to power gate has been taken in the transition fromsixteen to eight execution units has been completed, the power may bemeasured and, therefore, the switching capacitance or application ratioin the new eight execution unit configuration is also determined. Ifthat turns out to be significantly higher than estimated, then the powergating decision that was taken was wrong. In such case, the decision canbe reversed, transitioning back to the larger configuration. If, on theother hand, the capacitance estimation of the smaller configuration wasdone correctly before power gating, then the extra dynamic powermeasured after the transition to the smaller configuration is less thanthe power savings. In that case, the new configuration may be maintainedand the power sharing mechanism naturally pushes to a somewhat higherfrequency, resulting from the net power reduction at iso-performance,providing a performance gain. Of course, the same considerations can beused to handle power gating of multiple subslices or slices.

In the case of deactivating a core portion, we may be transitioning froman eight execution unit graphics core to a sixteen execution unitgraphics core in some cases. We can use equations (1) and (2) to ensurethat the extra leakage of the sixteen execution unit graphics core willbe lower than the dynamic power savings achieved by reducing the clockfrequency by half. In that case, clock frequency can be raised whichwill increase performance.

FIG. 2 shows a sequence for making the power gating determining inaccordance with some embodiments of the present invention. The sequencemay be implemented in hardware, software, and/or firmware. In softwareand firmware embodiments, it may be implemented in computer executedinstructions stored in a non-transitory computer readable medium, suchas a magnetic, optical, or semiconductor storage.

In state 1 in this example, one subslice is active, as indicated atblock 20. A check at diamond 22 determines whether the power controlunit requests a new graphics processor turbo frequency. If so, a checkat diamond 24 determines whether the conditions to turn on a secondsubslice are met. If not, the new graphics turbo frequency is set (block26), as requested by the power control unit. If so, the second subsliceis turned on. A tentative graphics processor frequency is set (block 28)and the decision to power gate is then validated. If the validation issuccessful, as determined in diamond 30, the flow goes to state 2. Ifnot, the subslice is power gated again, as indicated in block 32, andthe processor returns to state 1.

In state 2, with two subslices active, as indicated at block 34, a checkat diamond 36 determines whether the power control unit has requested anew graphics processor turbo frequency. If so, a check at diamond 38determines whether the conditions to turn off a subslice have been met.If not, the new graphics processor turbo frequency is set (block 40) asrequested. Otherwise, at block 42, thread scheduling on the targetsubslices is terminated. The sequence waits for the target subslice tobecome idle and, then when it does so, turns off the target subslice. Atentative graphics frequency is set and then the decision to powerungate is validated. If the decision is validated at diamond 44, theflow proceeds back to state 1. Otherwise, the subslice is powered upagain, as indicated in block 46.

The computer system 130, shown in FIG. 3, may include a hard drive 134and a removable medium 136, coupled by a bus 104 to a chipset core logic110. The computer system may be any computer system, including a smartmobile device, such as a smart phone, tablet, or a mobile Internetdevice. A keyboard and mouse 120, or other conventional components, maybe coupled to the chipset core logic via bus 108. The core logic maycouple to the graphics processor 112, via a bus 105, and the centralprocessor 100 in one embodiment. The graphics processor 112 may also becoupled by a bus 106 to a frame buffer 114. The frame buffer 114 may becoupled by a bus 107 to a display screen 118. In one embodiment, agraphics processor 112 may be a multi-threaded, multi-core parallelprocessor using single instruction multiple data (SIMD) architecture.

In the case of a software implementation, the pertinent code may bestored in any suitable semiconductor, magnetic, or optical memory,including the main memory 132 (as indicated at 139) or any availablememory within the graphics processor. Thus, in one embodiment, the codeto perform the sequence of FIG. 2 may be stored in a non-transitorymachine or computer readable medium, such as the memory 132, and/or thegraphics processor 112, and/or the central processor 100 and may beexecuted by the processor 100 and/or the graphics processor 112 in oneembodiment.

The following clauses and/or examples pertain to further embodiments:

One example embodiment may be a graphics processor comprising a memory,an interface logic, first and second independently gateable portions ofthe graphics processor, and logic to power gate a first portion and notthe second portion of the graphics processor so that said first portionis powered on and said second portion is powered off. The processor mayalso include wherein one of said portions is a processor core. Theprocessor may also include wherein both of said portions are processorcores. The processor may also include a power controller. The processormay also include a plurality of separate, identical processing units.The processor may also include wherein said processing units areindependently power gated.

The graphics processing techniques described herein may be implementedin various hardware architectures. For example, graphics functionalitymay be integrated within a chipset. Alternatively, a discrete graphicsprocessor may be used. As still another embodiment, the graphicsfunctions may be implemented by a general purpose processor, including amulticore processor.

References throughout this specification to “one embodiment” or “anembodiment” mean that a particular feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneimplementation encompassed within the present invention. Thus,appearances of the phrase “one embodiment” or “in an embodiment” are notnecessarily referring to the same embodiment. Furthermore, theparticular features, structures, or characteristics may be instituted inother suitable forms other than the particular embodiment illustratedand all such forms may be encompassed within the claims of the presentapplication.

While the present invention has been described with respect to a limitednumber of embodiments, those skilled in the art will appreciate numerousmodifications and variations therefrom. It is intended that the appendedclaims cover all such modifications and variations as fall within thetrue spirit and scope of this present invention.

What is claimed is:
 1. A graphics processor comprising: a memory; aninterface logic; first and second independently gateable portions of thegraphics processor; and logic to power gate a first portion and not thesecond portion of the graphics processor so that said first portion ispowered on and said second portion is powered off.
 2. The processor ofclaim 1 wherein one of said portions is a processor core.
 3. Theprocessor of claim 2 wherein both of said portions are processor cores.4. The processor of claim 1 including a power controller.
 5. Theprocessor of claim 1 including a plurality of separate, identicalprocessing units.
 6. The processor of claim 5 wherein said processingunits are independently power gated.